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Higher criticism, or second-level significance testing, is a multiple- 
comparisons concept mentioned in passing by Tukey. It concerns a 
situation where there are many independent tests of significance and 
one is interested in rejecting the joint null hypothesis. Tukey sug- 
gested comparing the fraction of observed significances at a given 
a-level to the expected fraction under the joint null. In fact, he sug- 
gested standardizing the difference of the two quantities and forming 
a z-score; the resulting z-score tests the significance of the body of 
significance tests. 

We consider a generalization, where we maximize this z-score over 
a range of significance levels < a < ao. We are able to show that 
the resulting higher criticism statistic is effective at resolving a very 
subtle testing problem: testing whether n normal means are all zero 
versus the alternative that a small fraction is nonzero. 

The subtlety of this "sparse normal means" testing problem can 
be seen from work of Ingster and Jin, who studied such problems in 
great detail. In their studies, they identified an interesting range of 
cases where the small fraction of nonzero means is so small that the 
alternative hypothesis exhibits little noticeable effect on the distri- 
bution of the p-values either for the bulk of the tests or for the few 
most highly significant tests. In this range, when the amplitude of 
nonzero means is calibrated with the fraction of nonzero means, the 
likelihood ratio test for a precisely specified alternative would still 
succeed in separating the two hypotheses. 

We show that the higher criticism is successful throughout the 
same region of amplitude sparsity where the likelihood ratio test 
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would succeed. Since it does not require a specification of the al- 
ternative, this shows that higher criticism is in a sense optimally 
adaptive to unknown sparsity and size of the nonnuU effects. While 
our theoretical work is largely asymptotic, we provide simulations in 
finite samples and suggest some possible applications. We also show 
that higher critcism works well over a range of non-Gaussian cases. 



1. Introduction. In his Class Notes for Statistics 411 at Princeton Uni- 
versity in 1976 [31], Tukey introduced the notion of the higher criticism by 
means of a story. A young psychologist administers many hypothesis tests 
as part of a research project, and finds that, of 250 tests 11 were significant 
at the 5% level. The young researcher feels very proud of this fact and is 
ready to make a big deal about it, until a senior researcher (Tukey himself?) 
suggests that one would expect 12.5 significant tests even in the purely null 
case, merely by chance. In that sense, finding only 11 significant results is 
actually somewhat disappointing! 

Tukey used this story as a way to make vivid the notion of the higher 
criticism of such situations as multiple testing. He then proposed a sort of 
second-level significance testing, based on the statistic 

HCo.o5,n = ^^[(Fraction Significant at 0.05) - 0.05]/V0.05 x 0.95, 

and suggested that values of (say) 2 or greater indicate a kind of significance 
of the overall body of tests. (The same statistic was proposed and applied in 
a psychometric trial by Brozek and Tiede [10] even earlier, but without the 
catchy name.) 

Although Tukey 's discussion turned to other topics at that point, we may, 
if we like, imagine that it had continued in this vein. We might then consider 
not only significance at the 0.05 level, but perhaps at all levels between (say) 
and aQ, and so define 



HC* = max [(Fraction Significant at a) — a]/ J a x (1 — a). 



In this paper, we will analyze a statistic of this kind in a setting where 
there are a small fraction of nonnull hypotheses and derive an adaptive 
optimality for it. 

In our setting there are n independent tests of unrelated hypotheses, i^o,i 
vs. Hi^i, where the test statistics Xi obey 



0<a<ao 







Fi,i:X,~iV(^„l) 



fii>0. 



In the overwhelming majority of the tests, the corresponding null hypothesis 
is true (i.e., the corresponding normal mean /ij = 0), but some small fraction 
may concern tests where the null hypothesis is false (and so the mean fii > 0). 
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So the fraction of false null hypotheses, if nonzero, is small. Can we tell 
reliably whether the fraction is actually or not? 

We mention three application areas where situations like this might arise: 

• Early detection of bioweapons use. Suppose that there are n observa- 
tional units in a certain geographical region, and for each one we have a 
z-score associated with the presence of a certain symptom at rates higher 
than background. If a bioweapon has been used in that region, then in 
early stages we do not expect all observational units to be affected, we do 
not know which ones might be affected, and we do not want to wait until 
some observational unit begins to display wildly elevated rates. We want 
to detect while a small fraction begins to show individually significant 
results but no unit yet shows jointly significant results. 

• Detection of covert communications. In a signals intelligence setting we 
suppose that a small fraction of the signal spectrum in a certain situa- 
tion may be used for covert communications, which would mean that a 
few frequencies exhibit increased power. However, we do not know what 
frequencies those might be, and the specific frequencies being used might 
change randomly from one epoch to another, so that we never get a very 
definite indication that we are definitely seeing increased power in any one 
specific frequency. Nevertheless, we might still want to detect the presence 
of a small fraction of frequencies with slightly increased power. 

• Meta-analysis with heterogeneity. We have results from n experiments 
testing a certain treatment. It turns out that an unidentified experimental 
factor is crucial to success, but that only in a small fraction of experiments 
is this factor fortuitously chosen so that the experimental performance 
follows a nonnull distribution. Can we reliably detect the presence of a 
small fraction of well-laid out experiments among many hopeless ones, 
when we do not know which ones may be well-laid out? 

There are many other potential applications in signal processing; see, for 
example, [19]-[21]. In spatial statistics Kendall and Kendall [25] developed 
a statistic closely related to HC* called "pontogram" for the purpose of 
detecting near-alignments in sets of points. 

1.1. The model, and the asymptotic detection boundary. Translating our 
problem into precise terms, we begin by scrutinizing a special case where all 
the nonzero fii are equal, and we can then model our data as providing n 
i.i.d. observations from one of two possible situations: 



(1.3) 




i.i.d. 



1 < i < n. 



(1.4) 




i.i.d. 



{l-e)N{0,l)+eN{fi,l) 



1 < i < n. 
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(n) 

Here Hq denotes the global intersection null hypothesis, and denotes 

(n) 

a specific element in its complement. Under , a fraction e of the data 
comes from a normal with common nonnull mean. Here £ = £n and the mean 
11 = will be chosen to make the problem very hard, but (just barely) still 
solvable. 

Obviously, in this situation, with e„ and /i^ fixed and known, the opti- 
mal procedure is simply the likelihood ratio test; a careful analysis of its 
performance in [23] (cf. also [16] and [17]) tells us the following. Suppose we 
let e„ =n~^ for some exponent f3 G (^jl), so that the fraction of nonzero 
means is small but not vanishingly small. In this range the number of nonzero 
means is too small to be noticeable in any sum which is, in expectation, of 
order n, so it cannot noticeably affect the behavior of the bulk distribution 
of the values. Let 

(1.5) /i„ = V2rlog(n), < r < 1. 



As fin < v^^ogjnj, the nonzero means are, in expectation, smaller than the 
largest Xi coming from the true component null hypotheses, so the nonzero 
means cannot have a visible effect on the upper extremes. Clearly, this is a 
rather subtle testing problem. 

It turns out that there is a threshold effect for the likelihood ratio test: 
the sum of Type I and Type II errors tends to or 1 depending on whether 
/i exceeds a so-called detection boundary or not. In detail, there is a function 
p* [13) so that 

if r > p*(/3), Hq and //^"^ separate asymptotically, 
if r < p*{l3), Hq and //j""^ merge asymptotically. 

In short, p*{P) defines a precise demarcation between what is possible and 
impossible in this problem, that is, how big the nonzero effect must be to be 
detectable as a function of the rarity of nonzero effects. Hence, we have the 
term detection boundary. Indeed, translating results of Ingster [17] to our 
notation (see also [23]), 



(1.6) p*iP) 




|</3<1- 



If we think of the (r, /3) plane, 0<r<l, i</3<l, we are saying that, 
throughout the region r > p*{P) the alternative can be detected reliably 
using the likelihood ratio test (LRT). Unfortunately, the usual (Neyman- 
Pearson) likelihood ratio requires a precise specification of r and /?, and 
misspecification of (r, /3) may lead to failure of the LRT; see [23] for a discus- 
sion. Naturally, in any practical situation we would like to have a procedure 
which does well throughout this whole region without knowledge of (r, /3). 
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Bickel and Chernoff [6] and Hartigan [14] have shown that the usual gen- 
erahzed hkehhood ratio test maxp ,^{dPl''\e,fj.)/dP^"'^){X) has nonstandard 
behavior in this setting; in fact the maximized ratio tends to oo under Hq. 
It is not clear that this test can be relied on to detect subtle departures from 
Hq. Ingster [18] has proposed an alternative method of adaptive detection 
which maximizes the likelihood ratio over a finite but growing list of simple 
alternative hypotheses. By careful asymptotic analysis, he has, in principle, 
completely solved the problem of adaptive detection in this setting; however, 
this is a relatively complex and delicate procedure which is tightly tied to 
the narrowly specified model (1.3) and (1.4). It would be nice to have an 
easily implemented and intuitive method of detection which is able to work 
effectively throughout the whole region 0</3<^, r>p*(/3), which is not 
tied to the narrow model (1.3) and (1.4), and which is in some sense easily 
adapted to other (non-Gaussian) mixture models. This is where HC* comes 
in. 



1.2. Performance of higher criticism. To apply the higher criticism, let 
us convert the individual test statistics into another form. Let pi = P{N(0, 1) > 
Xi} be the p-value for the ith component null hypothesis, and let the 
denote the p- values sorted in increasing order, so that under the intersection 
null hypothesis the behave like order statistics from a uniform distribu- 
tion. 

With this notation, we can write 



HC;= max \/n [i/n - p(i)]/ Vp(i)(l - P(i)). 

Despite the closeness of our statistic to the one in [2] , note that what we are 
doing is not arbitrary goodness of fit; instead we are dealing with a specific 
kind of multiple hypothesis testing, outside of which the problem would not 
be interesting. 

To use HC* to conduct a level-a test, we must find a critical value h{n, a): 

PhJHC; > h{n,a)}<a. 

Adapting asymptotic theory for the normalized empirical process as in 
[27], Chapter 16, gives us the following information on the size of h{n,a): 

Theorem 1.1. Under the null hypothesis Hq, 

(1.7) , 1, n-oo. 

V21oglog(n) 

It follows that, for fixed a > 0, h{n, a) w ^2 log log(n). For asymptotic 
analysis, it is convenient to consider a sequence of problems indexed by n. 
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with critical values an — > 0. We will say that the level a^i — ^ slowly enough 
if 

h{n, an) = V21oglog(n)(l + o(l)). 

Theorem 1.2. Consider the higher criticism test that rejects Hq when 

HC* > h{n,an), 

(n) 

where the level a„ — > slowly enough. For every alternative defined 
above where r exceeds the detection boundary p*{f3) — so that the likelihood 
ratio test would have full power — higher criticism also has full power: 

P („) {Reject Hq} ^1, oo. 
^1 

Roughly speaking, everywhere in the amplitude/sparsity (r, /3) plane — 
where the likelihood ratio test would completely separate the two hypothe- 
ses asymptotically — the higher criticism will also completely separate the 
two hypotheses asymptotically. Of course, in the cases where the ampli- 
tude/sparsity relation falls below the detection boundary, all methods fail. 
More precisely, we only claim that higher criticism works in the interior of 
this region. Just at the critical point where r = p*(l + o(l)), our result says 
nothing; this would be an interesting (but very challenging) area for future 
work. 

1.3. Which part of the sample contains the information? Underlying our 
results is a set of insights about "where to look" for evidence against Hq [30], 
how the evidence may not be in the "obvious" place and how the adaptation 
in HC* automatically ensures that the best evidence will be included in 
making the "case" against Hq. 

To get started, note that, in the null case HC* is closely related to well- 
known functionals of the standard uniform empirical process. This is be- 
cause, under Hq, the n p- values can be viewed as i.i.d. samples from U{0, 1). 

Formalizing, given n independent random samples Ui, . . . ,Un from U{0, 1), 
with empirical distribution function 

1 " 

i=l 

the uniform empirical process is denoted by 

Un{t) = y/^[Fn{t)-t], < t < 1, 

and the normalized uniform empirical process by 
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Note that Wn{t) is asymptotically A^(0, 1) for each fixed t G (0, 1). 

Since, under Hq the p-values are i.i.d. U{0, 1), we have, in that case, the 
representation 

HC!= max WJt); 

0<t<ao 

note the use of max instead of sup, since the maximum value of Wn{t) is 
attained. 

We will extend usage below by letting Wn (t) also stand for the normalized 
empirical process starting from Fn{t) = ^ J2?=i l{pi<t}' where the pi are the 
n given p- values, but which are i.i.d. C/(0, 1) only in the null case. Accord- 
ingly, Wn{t) will be asymptotically iV(0, 1) at each fixed t only under Hq, 
and one anticipates a different limiting behavior under the alternative. 

We now look for the value of t at which Wn{t) has the most dramatically 
different behavior under the null and the alternative. 

We introduce the notation Zn{q) = ^2glog(n), for < q < 1, and look 
for the value of q for which ProbjXj > Zn{q)} best differentiates between 

(n) 

null and alternative. Recall that, under the alternative the data have a 
sparse component with nonzero mean at fin = Zn (r) ■ It might seem that the 
most informative part of the sample would be in the vicinity of fin, where 
data from the alternative are most common, and that therefore the most 
informative value of q is q = r. Surprisingly, this is not the case. 

To find this most informative q, we introduce some notation. Let pn^q = 
P{N{0, 1) > Zn{q)}, q>0, and note that 

max Wnit)= max Wn{Pn,q)- 

0<t<l/2 0<g<oo 

It will be immediately clear that it is only necessary to consider < q< 1. 
Let Nn{q) count the observations exceeding Zn{q): 

Nn{q)=#{t-Xi>zn{q)}. 

Then define 

\/ npn,q{l - Pn,q) 

In many calculations, we will find factors with polylog behavior, ~ Const log' 
where a may be positive or negative depending on the case. When such fac- 
tors are multiplying terms 'nP' for 7 7^ 0, the polylog factors have a weak 
influence on the eventual growth rate of the resulting product. To focus on 
the main ideas, we introduce the notation L„ for a generic polylog factor, 
which may change from occurrence to occurrence. When we do this, we think 
of Ln essentially as if it were constant. 
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Now 

P{iV(^„, 1) > = L„ • n-(v^-v^)' , 

P{N{0,l)>Zn{q)}=Ln-n-'^, r<q<l. 

(n) 

It follows that, under the alternative , we have 

yn - n 

while under the null EVn{(i) = 0- The most informative value of q will opti- 
mize the growth rate of ?i[(i+9)/2^/5~(v^~v^) 1 to c«. The most informative 
value of q satisfies 

if r < i, then g = 4r and EVn{q) = ^n, • nl'-^^-^/^M, 

ifr>i, theng=l and £;K(^) = • r^^^^"'^^^^^^^^''- 

In fact, there can be considerable latitude in choosing q so that EVn{q) 

(n) 

goes to oo under . This requires 



Notice that r > p*{f3) implies r — /3 + ^ > 0. Within this interval the "most 
informative choice" of q is the center: = 2-y/r in case r 

Translating this into the original z-scale yields the surprise mentioned 
earlier. When r <j^, the most informative place on the original z-scale is not 
at as we might suppose, but at 2/i„. By going out "in the tails" farther 

(n) 

than observations from both Hq and H\ are becoming extremely rare, 

in) 

but the ones from H\ are far more frequent in a relative sense. 

The story when r > | is somewhat different. There, when looking for a 
discrepancy, we had best look near ^J2\og{n), which is less than 2^ri- The 
point is that observations from Hq almost never get substantially larger than 
^21og(n), so there is no need to look much farther out in the tails. 



1.4. Comparison to several multiple comparison procedures. The higher 
criticism is just one specific approach to combining many p- values in search 
of an overall test; many other tools are available from the field of multiple 
comparisons (e.g., [15], [26] and [33]) and meta-analysis [3]. How do these 
other tools perform? 

In this section we describe several specific procedures and their range of 
detectability. See Figure 1. 



HIGHER CRITICISM FOR DETECTING MIXTURES 



9 



1.4.1. Range/ Maximum/ Bonferroni. One of the most classical and fre- 
quently used tools in multiple comparisons, also associated with Tukey [33], 
is the studentized range: 

Rn = (maxXj - minXi)/5n, 

where Sn is the sample standard deviation. This is frequently used in testing 
for homogeneity of a set of normal means, and could well be used in the 
setting we have defined here. For our theoretical purposes, it is convenient 
to analyze the simpler statistic 

Mn = max Xi , 

i 

where we focus attention on one-sided deviations only and have no need 
to estimate the (known) standard deviation under the null. (We note that 
in the field of meta-analysis this is the same thing as combining several p- 
values by taking the minimum p- value [3] ; another equivalent term for this 
is Bonferroni-based inference.) The maximum statistic has a critical value 
m(n, a) which obeys 

m(n, a) ~ \/21og(n), n—^oo. 

In comparison to HC*, it follows that the test focuses entirely on whether 
there are observations exceeding \/21og(n). We again say that a„ goes to 
slowly enough (now for use with M„) if 

m{n,an) ~ V21og(n). 
The following result summarizes the behavior of Mn. 




0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 



Fig. 1. Left/ three regions of the f3 — r plane. The detection boundary separates the de- 
tectable region from the undetectable region. For the estimable region, it is possible not only 
to detect the presence of nonzero means, but also to estimate those means. Right/ two de- 
tection boundaries. The one on the bottom is the optimal detection boundary as well as the 
detection boundary for HC*; the one on the top is for range/maximum/Bonferroni/FDR. 
Two detection boundaries are only different when | < /3 < 1. 
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Theorem 1.3. Define 

PMa.(/3) = (l-^r^)'. 

Suppose r > pMax(/5) and consider a sequence of level an-tests based on Mn 
with a„ — > slowly enough. Then 

P^(n){Mn rejects Hq} ^1. 

In short, pMax defines the detection boundary for M„. This compares to 
the "efficient" boundary as follows: 

p*m=PM.AP), /3G[f,i), 

so that M„ is effective in the range of very sparse alternatives, while 

so that Mn is inefficient if /? < |. In particular, note that 

= p*(i) < pMax(^) = (2 - V2f/4 « 0.0858. 

We interpret this as saying that M„ (and the Studentized range i?„ as well) 
can be dramatically outperformed when there may be about n^/^ nonzero 
means among the n observations. Compare Figure 1. 

1.4.2. FDR-controlling methods. Recently considerable interest has been 
focused on the false discovery rate (FDR)-controlling methodology for simul- 
taneous inference. In one example of this so-called FDR approach [4], one 
considers, for A; = 1,2,3,..., the k most significant p- values. These are com- 
pared with Q^, where a is a critical value (e.g., 0.05). If the p-values in the 
group are all smaller than the standard for comparison, then Hq is accepted 
at that stage. If some are larger than the standard for comparison, then 
h["'^ is accepted, no further k are considered, and that specific group of k 
hypotheses is identified as containing likely nonnull hypotheses. Viewed as a 
hypothesis testing procedure for the intersection null hypothesis, Benjamini 
and Hochberg [4] show that this procedure has level less than or equal to 
a. They also show that procedure controls the FDR, which means, roughly 
speaking, that, in expectation, a fraction of at least (1 — a) of the rejected 
null hypotheses should be truly nonnull hypotheses. 

How does such a procedure behave in the current setting? We begin by 
pointing out that Abramovich, Benjamini, Donoho and Johnstone [1] have 
analyzed the behavior of FDR in exactly the kind of mixture model described 
above in (1.3)-(1.4), but where is calibrated differently from (1.5), and 
they found an asymptotic minimaxity of FDR in that setting. That is, they 
considered a situation where one observed data Xi = 6i + Zi and where the 
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Zi are i.i.d. A^(0, 1). They supposed that only a fraction e„ of the means Qi 
might be nonzero. This is similar to our model above. 

As it turns out, for our measure of performance, the behavior of FDR-controlling 
procedures is not different from that of the maximum M„. To articulate this, 
consider the detection boundary for the FDR-controlling procedure above. 
This is the function pfdr(/?) such that, if Hn = \/2r log(n), and if we use a 
sequence of levels a = tending to slowly enough, then for r > pfdr(/9) 
the procedure has power tending to 1 as n ^ oo, while for r < pfdr tlie 
procedure has power tending to 0. 

Theorem 1.4. 

(1.8) /9FDR(/3)=PMax(/3) = (l- VT^)^ ^<(3<1. 

From our discussion of the behavior of the maximum M„, the FDR- 
controlling procedure is effective in the range /3 G [|, 1), while it is relatively 
inefficient for /3 < |. Compare Figure 1. 

1.5. Classical methods for combining p-values. In the literature of meta- 
analysis [3], one also faces the problem of combining several p-values to 
achieve an overall test of significance. In that literature the component non- 
null hypotheses are all the same ( "homogeneity" ) , whereas in our discussion 
they vary widely ("heterogeneity"). A classical approach to combining p- 
values is Fisher's method [13], which refers to Fn = —'^J2i<i<n^og{pi) as the 
Xn distribution. In our setting of extreme heterogeneity. Fisher's method is 
unable to function well asymptotically: 

Theorem 1.5. If £n = n~^, 13 >\ and fin < ^21og(n), asymptotically 

in) 

Fn is unable to separate H\ and Hq. 

1.6. Relation to goodness-of-fit testing. Of course, the method we are 
discussing may be viewed as an application of a goodness-of-fit measure, 
comparing the empirical distribution of p-values to the uniform distribution. 
As such, it may be compared to many goodness-of-fit procedures where 
distribution under Hi differs from that of Hq in one tail. 

Thus, Anderson and Darling [2] defined a goodness-of-fit measure which 
involves the maximum of the normalized empirical process. Translated into 
the current setting, this initially seems very close to HC*. The main dif- 
ference is that we focus attention near p = 0, while Anderson and Darling 
maximize over a <p<b with < a < 6 < 1. However — important point — all 
the information needed for discrimination between -ffg"^ is at val- 

ues of p increasingly close to as n increases. Therefore, statistics based on 
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a <p <b with a, b fixed are dramatically inefficient. Similarly, Borovkov and 
Sycheva ( [7] and [8] ) proposed statistics based on the maximum of a normal- 
ized empirical process with a general nonlinear normalization g{p) not nec- 
essarily \/p(T^^p). However, the maximum is over a <p <b, < a < b < 1, 
so our remarks on the Anderson-Darling statistic apply. 

Berk and Jones [5] proposed a goodness-of-fit measure which, adapted to 
the present setting, may be written as 

(1.9) BJ+=n-^max^^i^+(i/n,p(j)), 
where is defined by 

^ log- + (1 - t)iog^^ — a <x <t <i, 

K^{t,x) = ^Q^ ^ ^ ifO<t<2;<l, 

-oo, otherwise, 

and is motivated by large deviation theory. Roughly speaking. Lemma 6.4 
below shows that K^{t,x) behaves as ^((t — x)^)/(x(l — x)), so it should not 
be surprising that for our measure of performance the detection boundary 
of BJ^ is the same as that of HC*. (Note, however, that full justification of 
the asymptotic claim in [5] has only recently been provided; see [35] for a 
thorough analysis, which also may shed light on the limiting distribution of 
HC* .) To articulate our claim about the detection boundary of the Berk- 
Jones method, define a function pbj{P) such that, if /i„ = y^2r log(n), and 
if we use a sequence of levels a = a„ tending to slowly enough, then for 
r > PBjiP), tbe procedure has power tending to 1 as n — > oo, while for r < pBJ 
the procedure has power tending to 0. 

Theorem 1.6. pbjW) = p*{P), \ <P<1. 

However, HC* is still better than BJ^ in important ways; we will discuss 
this in the Appendix, where we prove Theorem 1.6. 

1.7. Generalizations. As we have defined it, the higher criticism statistic 
can obviously be used in a wide variety of situations and there is no need for 
the p-values to be derived from normally distributed .Z^-scores, for example. 
Consequently, numerous other settings for its deployment can be considered. 
We have found that, in a wide variety of settings where one has data which 
are "sparsely nonnull," the HC* statistic has an adaptive optimality. 

To give the flavor of one of these, we consider a model deriving from 
the covert communications example mentioned earlier. The problem is one 
of noncooperative spread-spectrum signal detection. Here one observes n 
periodogram ordinates Xi. In the "covert data absent" case, these represent 
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periodogram ordinates of a white noise, while in the "covert data present" 
case a small fraction of periodogram ordinates are inflated by the presence 
of covert signal. The formal model takes the form: 

Ho : Xi ■ Exp(2), 1 < i < n, 

/^("^ :X,'-^^-(l-e)Exp(2) + ex2(<5), l<i<n. 

[Here Exp(2) denotes the exponential distribution with mean 2 and xK"^) 
denotes the noncentral chi-squared distribution with noncentrality parame- 
ter 5.] Here the data are non-Gaussian, n is large, and the sparsity parameter 
e = £n = as before. The strength of the covert signal is measured by the 
noncentrality parameter 6 of the chi-squared distribution, which we take as 
S = 5n = 2rlog(?i) for an underlying amplitude parameter r having much the 
same interpretation as before. (In a cooperative signal detection setting we 
would know a priori which of the coordinates Xi will exhibit the presence 
of the covert signal; in the noncooperative case we would not know this.) 

We can again apply the principle of higher criticism in this setting, defin- 
ing p-values through the component null hypotheses: 

p, = Prob{Exp(2)>Xj, i = l,...,n. 

We can also define the detection boundary for this test, as before. This is 
the function pnc,Exp{P) such that, if 6n = 2rlog(n), and if we use a sequence 
of levels a = On tending to slowly enough, then, for r > PHC,Exp(/3)) the 
procedure has power tending to 1 as n — > oo, while, for r < PHC,Exp) the 
procedure has power tending to 0. We can also define the intrinsic detection 
boundary, the function p^^p{P) such that if r < p^^p the two hypotheses 
merge asymptotically, while if r > p^^p ^^e two hypotheses separate asymp- 
totically. 



Theorem 1.7. 



(1.10) PHC.Exp(/?)=PExp(/?) = <!; "/^,2 i^^- 




- < d<- 

2 ^ /-^ — 4 ' 



</3< 1. 



In words, the higher criticism statistic achieves the optimal detection 
region in the (r, (3) plane; interestingly, this region is the same as we had in 
the Gaussian case. See Figure 3; also compare to Figure 1. 

Other non-Gaussian settings are discussed in Sections 5.1 and 5.2. In 
each case the higher criticism statistic achieves the (interior of) the optimal 
detection region. 
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1.8. Contents of this paper. In this paper we establish the key results 
referred to so far, and then we consider various generalizations. Section 2 
develops Theorems 1.1 and 1.2. Section 3 discusses HC*" as a variant of 
HC* but with better performance in finite samples. Section 4 describes some 
simulation experiments. Section 5 considers numerous non-Gaussian settings 
and shows consistently the superiority of higher criticism over Bonferroni 
and other ideas from the field of multiple comparisons; the proof of Theorem 
1.7 is also in this section. The Appendix provides proofs of Theorems 1.3-1.6 
as well as Lemma 5.1. 



2. Main results. Before continuing the narrative flow of the paper, we 
pause to prove our key results: Theorems 1.1 and 1.2. Later sections will 
return to the narrative format. 



2.1. Proof of Theorem 1.1. The idea behind Theorem 1.1 is to simply 
apply known results from the theory of empirical processes. We recall the 
normalized uniform empirical process Wn{t) introduced in Section 1.3, and 
remind the reader that, under i^O; 

HC*= max WJt). 

0<t<l/2 

The normalized empirical process has been studied carefully by a number of 
authors, and a summary of results can be found in [27], Chapter 16. There, 
in (16.20), they show that 



^^-^^^ maxo<t<»o Wnjt) p ^ 

v/2TogTogp^ 



n — > oo, 

n) 



by an argument that depends on the work of Jaeschke [22]. This, in turn, 
depends on the approximation of Wn by a Brownian bridge and on a result 
of Darling and Erdos [11], which says that, if B{t) is standard Brownian 
motion starting at B(0) = 0, then 

B{t) 1 p ^ 

sup p = ^ 1, U — > CXD. 

[l,n] Vt V 2 log log M 

Of course, (2.11) implies Theorem 1.1. □ 



2.2. Proof of Theorem 1.2. We begin with a simple observation. 



Lemma 2.1. Let Xi, ... ,Xn be i.i.d. Bernoulli^irn) and let ai, . . . ,an be a 
sequence of real numbers. If m^n^ oo and an/ y/m^n — > oo, i/ien lim„_>oo P(X]"= 
7r„) < -an) = 0. 
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Proof. Since Var(X]"=i ^i) < t^t^u-, use of Chebyshev's inequality yields 

U=l J ^ 

To prove Theorem 1.2, we note that it is enough to show 
(2.1) Jim^P^(„){HC; < V41oglog(n)} = 0. 

Now, recalling the definition of Vn{q), we have 

HG:> sup Vn{q). 

0<g<l 

For < g < 1, recall that pn^q = P{A^(0, 1) > Zn{q)}] we also put 

= ^{(1 - en)A^(0, 1) + enA^(/in, l)^n(g)}. 

We now consider two cases. First, suppose that r > p*{f3) and r > |; then 
the hypothesis is detectable, if at all, merely by looking at the maximum 
of the Xi. Note that y/r + ^/l — (3> 1. Now, as V^(l) < supo<g<i Vniq) and 

K(l) =D (^n(l) - npn,l)/^Pn,l{^-Pn,l), 

P in) \ sup Vn{q) < V41oglog(n) 
1 10<(?<1 

< P^(„){K(1) < V41oglog(n)} 



< P^(n){Nn{l) < y/npn,W 4:loglog{n) +npn,i}- 



Under iV„(l) is a sum of independent Bernoulli (p^ i), and by direct 

calculations 



Pn,i = o{l), 



P'n,i = ,^ , , — n-^-(i-v^)^(l + 0(1)). 

747rlog(n)(l - V^) 

With Lemma 2.1 in mind, let Trn=p'^i and 

a-n = np'n i - [V'^Pn,iV41oglog(n) = ^(1 + o(l)), 

so riTin — > oo and an/ ^Jm^n — > oo; the desired result (2.1) in this case follows 
from Lemma 2.1. In the second case, suppose r > p*{P) and r < |. It will 
turn out that the hypothesis is detectable based on HC* but not on the 
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maximum of the Xi . Note that (3 + r < 1 . 

PMn)\ sup Vn{q) < V41oglog(n) \ 
1 lo<g<l J 

< P^(„){K(4r) < V41oglog(n)} 

< P^(„){Nn{4r) < ^npn^4rV 4:loglog{n) + npn,4r}- 
Under ^ iV„(4r) is a sum of n independent Bernoulh(p^ 4^), and 

P^''^^ = A / \ ^ ^ "-"^''(1 + 0(1)), 

4-\/7rr iog(nj 

P'n,4. =Pn,4. + ,^ ^ , n'(^+-)(l + o(l)). 

■\/47rr log(nj 
With Lemma 2.1 in mind, let 7r„ =^^4^ and 

an = n[p'n^4^r - PnAr] " V'^Pn,4r V4 log log(n) . 

Direct calculation shows 

an ^ r OK-(^-i/2)/(iog(n))i/4)^ if ^ > 3^^ 

lO(ni/2[i-(/3+0]/(log(n))i/4)^ if/3<3r, 

so riTTn — > 00, an/y/nTTn — > oo; the desired result (2.1) follows for this case 
from Lemma 2.1. □ 

3. A refinement. In the most challenging cases, where < r < ^, the 

analysis above tells us that the real information about the presence of 

is located away from the extreme values, and so, perhaps unexpectedly, the 

few smallest pi are not relevant to the detection problem. 

This suggests that we will still be able to reach the full interior of the 
detection region if we work with a modified statistic involving the maximum 
over all p- values greater than or equal to 1/n, which is 

^1, [i/n - / V (1 - ) . 

l<?.<n/2, P(i)>l/n 

Of course, this restriction is not necessary from the viewpoint of asymptotic 
analysis, since our results show that HC* is effective without any adjustment. 
However, our experience with moderate-sized samples shows this adjustment 
to be quite valuable. 

The empirical process viewpoint is helpful. Under Hq HC* behaves as 

(3.1) sup Wn{t). 

0<t<l/2 
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For comparison, HC^ behaves as 

(3.2) sup Wn{t). 

l/n<t<l/2 

The hmitation of the range in (3.2) is important. The quantity in (3.1) can be 
very much dominated by behavior in the vicinity of 0, or, more particularly, 
by the smallest observation. Since the smallest observation (smallest p- value) 
is not where the information for detection resides, this emphasis is misplaced. 

Speaking quantitatively, the statistic can have a heavy-tailed distribution 
under the null hypothesis, as we will show in a moment. Since, ordinarily, 
one does not want to use test statistics with heavy-tailed distributions — 
because tests at stringent levels would have low power in small samples — the 
quantitative motivation for HCj is clear. 

The heavy tails under the null hypothesis come from the following ef- 
fect. Recall that, under Hq, the minimum value has a distribution that 
is Exp(l/n). It follows that the first component in the maximum defining 



HC;_i = [l/n - P(i)]/Vp(i)(l -p(i)), 
has the asymptotic distribution 

D. 1_ 

rE 

where E is exponentially distributed with mean 1. Now, for large 

P(^-^ - V^> = P(V^< (V?T4- t)/2) ~ i. 

Hence, HC* ^ has "heavy tails" under Hq. 

Numerical simulations show that unusually large values of HC* under the 
null hypothesis are most frequently caused by HC* suggesting that if we 
restrict the range of the maximum as in HC^ this "heavy tail" is thinned 
out considerably. Numerical experiments support this analysis. Adapting 
material from [12], or from [27], page 600, it seems we have the following 
asymptotic law for HC^: 

6„HClt ~ Cn — > E'?., 



where 



bn = \/21og log(n), Cn = 21oglog(n) -h i log (log log (n)) - ilog(47r), 

and the c.d.f. of E^ is exp(— 2exp(— x)). Experiments show that this is a 
fairly accurate approximation for moderate n. The same form of limiting 
distribution holds for HC* , of course, but it is empirically a poor fit. 
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4. Simulation. We have conducted a small-scale empirical study of the 
performance of HC* and HC^. Our idea was to select a few interesting 
(r, /?) pairs in the detectable region — above, but close to the boundary, to 
create samples from null and alternative — and to study the behavior of the 
proposed statistics. 

We took n = 10^, e = jggo and /i = ^2-0.15 • log(n) w 2.04 for the follow- 
ing experiment: 

1. Draw n samples from A^(0, 1) to represent Hq; calculate HC* and HC"''. 

2. Replace 1000 of the previous samples by the same number of samples 
from N[fi,l) to represent h[^^; calculate HC* and HC^. 

3. Repeat 1 and 2 100 times and make histograms of the simulated HC* 
and HC+. 

See Figure 2 for the results. As can be seen from the theory above, de- 
tectability requires increasingly large samples as one approaches the detec- 
tion boundary. In fact, depending on how close one goes to the boundary, 
the required sample sizes can become enormous. To have an idea of how 
large n may have to be, we consider the case r < ^, i</3<|. We recall 
that 

r = 0, under Hq, 

(4.1) ^^n(40 |_^.-(/3-i/2)/ 4/^^iog(n), under 

(n) 

Numerically, we calculate the values of -EV^(4r) under Hi for various n 
with r = 0.1, /? = ^ and r = 0.05, /? = ^, respectively; the results are summa- 
rized in Table 1, together with the values of \/21og log(n) for comparison; 
notice the variance of is roughly 1. This, of course, raises the issue of 




Fig. 2. Histograms for HC* and HC^. Top row; Behavior under Ho- Bottom 
row; Behavior under H["\ Left column; HC*. Right column; HC+. Here n = 10^ 
Ai= ^0.3 log(n) f» 2.04, e = 10"^ 
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Table 1 

EV„{4:r) under ffj"' for various n. The parameters {r,(3) are 
r — 0.1, l3 — ^ and r = 0.05, /3 = |, respectively; values of ^2 log log(n) 
are also included for comparison 



n 




10^ 


10^ 


10^ 


10^ 


10^° 


^21oglog(n) 
EV44r) r = 
r — 


0.1, 

0.05, /?= i 


2.2916 
2.7582 
1.6439 


2.3579 
3.3411 
1.7748 


2.4139 
4.0680 
1.9259 


2.4622 
4.9728 
2.0982 


2.5046 
6.0976 
2.2931 



whether our theory adequately describes practice in small samples and, in 
particular, whether the motivating examples of the Introduction really offer 
credible scenarios for deployment of these results. We leave such discussion 
to another occasion. 

The large-sample issue also raises the question of how to efficiently sample 
from the normal distribution with very large n. Our approach goes as follows. 

1. Pick a small number e, such as e = 10~^ or e = 10~^. 

2. Simulate samples from uniform distribution at quantiles greater than 
1-e. 

• Sample a number K from the Poisson distribution with mean ne. 

• Generate K samples {Ui, . . . ,Uk) from the uniform distribution on 

• Generate K samples of (zi, . . . , zk) by letting 

(4.2) yi = -21og(l-C/i)-log27r, Zi = Vyt - logyi, l<i<K. 

Approximately, the (zi, . . . , zk) can be viewed as if a sample of size n 
had been taken from the Normal distribution, and then only the (1 — e)n 
largest sample values were retained. Compared to brute force, the algorithm 
requires only en flops, rather than n flops. Obviously, the accuracy of the 
approximation in (4.2) depends on how small e is; the smaller the e, the more 
accurate the approximation, and thus the more accurate the simulation. 

5. Other settings. The principle of higher criticism can be applied in a 
much wider series of situations than the Gaussian model studied so far, as 
we now show. 

5.1. Chi-Squared. Let Xui^) denote the usual chi-squared distribution 
with v degrees of freedom and noncentrality parameter 6. Consider the prob- 
lem of testing between these hypotheses: 

(5.1) i/o:^^'~-x'(0), l<i<n, 

(5.2) : X, '-L^- (1 - e)xl{0) + exUd), l<i<n. 
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Here e = Sr, 



n 



as before. Owing to the representation of chi-squared 



r.v.'s as sums of squares of standard normals, the problem can be rewritten 
in terms of arrays of normals Zij A^(0, 1): 



(5.3) Hq : X, 



i.i.d. 



1 < 2 < n, 



(5.4) Ff") ■.xr-^-{X-e)Y^zl^e 



i.d 



(zil + sf + . 



I < i <n, 



where Zij A^(0, 1). Roughly speaking, a small fraction of the normals 
may have nonzero mean, and we must base our decisions on sums of squares 
rather than on the normals themselves. The problem is obviously related to 
the Gaussian problem discussed above; only the observations with nonzero 
means occur in groups, and while the nonzero means are not equal within a 
group, the sums of squares of those means within a group must be equal. 

Now, obviously, if z/ = 1, then we are equivalently dealing with squares of 
individual normals, and so are just seeing a two-sided variant of the original 
one-sided normal testing problem considered so far. As it turns out, all 
the detection boundary and attainability results for the two-sided normal 
problem are the same as in the one-sided case. 

If v = 2, we view this as modeling the covert communications problem 
of Section 1.5. Indeed, the real and imaginary components of the discrete 
Fourier transform of Gaussian white noise are normal and independent; the 
sum of squares of those two components is just the periodogram. In a fre- 
quency where there is no signal, only Gaussian noise, the periodogram has a 
xKO) ['^^ E'Xp(2)] distribution, which is precisely the exponential mentioned 
earlier; while, in the signal-present case, the periodogram has a X2i^k) dis- 
tribution, where 5^ is the signal energy at that frequency. 

If 1/ > 2, we can think of agricultural trials, where a treatment is attempted 
with replications and in many different blocks. The alternative hypothesis 
is that, in a few special blocks, the treatment has a substantial effect, but 
in most blocks it has no effect. 

As it turns out, for any fixed, constant number of degrees of freedom, the 
results will be similar. Let the noncentrality parameter obey 

6 = 6n = 2rlog{n), < r < 1. 

In this problem, we again have an (r, (3) plane, in which there is a region of 
detectability and a detection boundary. There are two key facts: 

• The optimal detection boundary p*2(/5) is the same as in the Gaussian 
case: 
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P 



Fig. 3. Three regions in the r-jS plane for the model (5.1)-(5.2). The detection bound- 
ary separates the detectable region from the undetectable region. For the estimable region, 
it is possible not only to detect the presence of nonzero means, but also to estimate those 
means. 



This is proven in [23]. 
• HC* is able to separate the null and alternative hypotheses throughout 
the interior of the detection region, and thus its detection boundary Puc,xl 
obeys 

/>Hc,xi(/3)=p;i(/3), i</3<i. 

The key ideas are illustrated in Figure 3 . 

The analysis supporting the performance of HC* is as follows. Define 
^nil) = 2glog(n). We then have 

P{xUO)>xUq)}^Ln-n-'^, 

^^'^^ P{xli6n) > xl{q)} ~ L„ • n-(V^-v/^)', r<q<l. 

The left-hand relation is proved as follows: 

poo 

P{xl{0) > xliq)} = —— p'^-'e-" l^dp 

1(2^ "'^/2<7logfn) 



folofffn))''/^-! 
^ Wi^|n|2 ^-.(1 + ^(1)). 

It takes a little more effort to check that the second equality is also true; 
this follows from the following lemma. 

Lemma 5.1. Let < r < g < 1 
P{xli6n)>xliq)} 
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(l-i')/4 1 

„-(V9-v^)'(l+0(l)). 



v/27rlog(?i) \qj 

The proof of Lemma 5.1 is given in the Appendix. Once the relations (5.5) 
are available, the analysis proceeds exactly as in the earlier Gaussian case. 
The most informative values of q (resp. x^) are at 

if r < ;| , then g = 4r or ~ 4(5„ , 

ifr>;|, then g = 1 or x^ « 21og(n), 

correspondingly. 

5.2. Generalized Gaussian [Suhhotin) distribution. The generalized Gaus- 
sian (Subbotin) distribution GN^(/i) has density function 

This class of densities was introduced by Subbotin [29]; see [24], page 195. 
It has many uses in Bayesian analysis; see [9], page 157, who cite several 
earlier references. The Gaussian is, of course, the special case 7 = 2. The 
case 7 = 1 corresponds to the double exponential (Laplace) distribution, a 
well-understood and widely used distribution. The case 7 < 1 is of interest in 
image analysis of natural scenes, where it has been found that wavelet coef- 
ficients at a single scale can be modeled as following a Subbotin distribution 
with 7 ~ 0.7 [28]. This suggests that various problems of image detection, 
such as in watermarking and steganography, could reasonably use the model 
above. 

A natural generalization of (1.3)-(1.4) is the following: 

(5.6) i/o :Xi'-~-GN^(0), 1 < i < n, 

(5.7) i/J"^ :X/-~-(l-e)GN^(0) + eGN^(/i), l<i<n. 
Here we choose the calibrations 

£n = n"^, = fij^n = {jrlog{n))^^^ , i</3<l, 0<r<l. 

First we will discuss the case 7 > 1. In this range the number of nonzero 
means is too small to be noticeable in any sum which is of expectation of 
order n; if r is not large, we cannot expect a visible effect on the upper 
extreme. In short, this detection problem will be a difficult problem. 

For 7 > 1, we showed in [23] that the detection boundary is defined as 

(5 8) o* (B) - / (2'^^'"'^ - - '2<(^<^- 2-^/(^-^), 

^ ' ' " \ (1 - (1 - (3yh)\ 1 - 2-7/(7-1) < ^ < i_ 
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0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 



Fig. 4. Left; Detection boundaries of the r-jS plane for model (5.6)-(5.7) for 7 < 1 
and 7 = 1.5, 2, 3 from top to bottom. Solid parts of the curves are line segments. Right; 
The common detection boundary for < 7 < 1 which separates the detectable region 
from the undetectable region. Three curves from top to bottom correspond to the detection 
boundaries of the Bonferroni method with 7 = ^, 7 = f ctnd 7 = 1. 



Now we discuss the case 7 < 1 . Jin [23] showed that there is a threshold 
effect for the hkelihood ratio test, but the detection boundary is quite dif- 
ferent, and, surprisingly, it can be described in terms of (r, /3) independent 
of 7: 

(5.9) p;(/3) = 2(/?-l), i </?<!. 

We have the following result. 

Theorem 5.1. Consider applying higher criticism to the p-values pi = 
P{GN^(0) > Xi\, i = 1, . . . ,n, in the setting just described. Then the detec- 
tion boundary puc-y for this procedure is the same as the efficient detection 
boundary: 

PKc,y{P)=p;{p), i</3<i. 

The basic phenomena are depicted in Figure 4. 

The analysis can be made very similar to the normal case. Specifically, 
introduce the notation z^^nio) = (7'?log("'))"^^''^ and let T^'y,n,q = -P{GN.^(0) > 
z-i,n{q)]i Q<q<l. Note that, when 7 = 2, z^^n{q) = Zn{q) and TT^^n,q = Pn,q- 
We have 

max Wnit)= max W„(7r^,„,„). 
0<t<l/2 0<g<oo 

Similarly, let N.y^niq) count the observations exceeding z.y^n{q)- 

N^,n{q) = *{i-X^>^^,n{q)} 
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and also 



By arguments which are obvious at this point, 

P{GN^(0) > ((?)} = L„ • n r<q<l. 

(n) 

It follows that, under the alternative Hi , we have 

-S^7,n(Q) = in , 

yn - n ^ 

= L„ •n[(l+«)/2-/3-(gi/^-ri/7)7]^ 

while under the null EV^^niQ) — 0- The most informative value of q will 
optimize the growth rate of n^(^^'''^^'^~^~^'^^^''~^^^'''^''^ to oo. 
For the case 7 > 1 , define 

r-, = (1-2-1/(^^1))^. 

The most informative value of q satisfies 

if r < r^, then g = - and EVn{q) = L„ ■ ^[(2^/^^-^'-!)^-^— (/^- V2)] ^ 

if r > ry, then g = 1 and EVn{q) = U ■ n^^^-'^^-^^-'''^''^''l 

For the case < 7 < 1, the story is quite different, and the main reason is 
that 

1 + ^ 



f3 - (gi/^ - r^/^y, 



2 

as a function of g, is strictly decreasing for any fixed < 7 < 1, so the most 
informative place to look is at 

q = r or, equivalently, at xk, ^. 

Notice that, under Hq, HC* behaves the same as in the normal case. 
Under the above analysis shows the behavior at the most informative 
place. We can argue exactly as in the proof of Theorem 1.2. The growth of 
EVy^ri{<l) easily surpasses the ^/A\og\og{n) threshold, and the result follows. 

There are some interesting points here. 

First, the detection boundary for all the cases where 7 < 1 looks like the 
limit of the boundaries for 7 > 1 as 7 — > 1. Second, the most informative 
place to look, for the case 7 > 1, is at 

1 

1-2-1/(7-1)^' 
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the coefficient 1/(1 - 2^i/(t^^)) ^ 1 as 7 — > 1; in comparison, for the case 
7 < 1, the most informative place to look is at x = ^. 

Third, it is interesting to notice that, for the case 7 < 1, the best that 
either the maximum or the FDR-controlling methods can obtain is 

(5.10) r>(l- (1-/3)1/^)^. 

This is strictly above the detection boundary as defined in (5.9) for any 
^ < /? < 1, while, in comparison, higher criticism can obtain the full interior 
of the region of detectability, for all 7. Fourth, notice that the performance 
of the maximum or FDR-controlling methods worsens compared to HC* 
when 7 — > 0. The best that the maximum or FDR-controlling methods can 
do when 7 ~ is to detect for r > 1 , while higher criticism is able to detect 
for r > 2/? — 1, i < /3 < 1, independent of 7; the superiority of HC* can be 
seen most prominently for the case /? ~ ^, 7 ~ 0, in which HC* is able to 
detect for r > 2/3 — 1 ~ 0, while the maximum or FDR-controlling methods 
are able to detect only for r > 1. Compare Figure 4. 

APPENDIX: PROOFS 
A.l. Proof of Theorem 1.3. 

Lemma A. 2. If Zn ~ Binomial(n, 7r„) and -7r„ — > 0, n7r„ — > 00, then P{zn > 
1}^1. 

Proof. 

P{Z^ = 0} = (1 - TTnT = e-"^°§(^-"") ^0. □ 

Proof of Theorem 1.3. When r > p+(/3) or, equivalently, I — (3 > 
(1 — "v/F )^, we can pick a constant c > depending only on (r, (3) such that 

l-(3>iVlT~c-^f. 

To prove Theorem 1.3, it is sufficient to prove 

(A.ll) P^w{Mn > V2(l + c) log(n) } ^ 1 asn^oo. 

Let 

N{c) = #{i : X, > V2(l + c) log(n) }. 
Then, under h[^'^ -^(c) ~ Binomial(n, g„.c)) where 

qn,c = P{il- en)N{0, 1) + EnNipn, 1) > V2(l + c)log(n)} 
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Notice that, as n — > oo, g^i.c — > and ng„^c oo. So letting vr^ = qn,c in 
Lemma A.l, we have 

P („) {iV(c) > 1} ^ 1 asn^oo. 

Finally, the desired result (A. 11) follows from 

P^in^Mn > V2(l + c)log(n)} = P^(„){iV(c) > 1}. □ 

A.2. Proof of Theorem 1.4. 

Lemma A. 3. For constants ^ < 6 <l, < a < 1, 

sup P{Binomial(n, t)>n- t/a] < 2e"'^(")'"'"^ , n ^ oo, 

{t:t>n-^} 

where c{a) = 1 — (logo + l)/a. 
Proof. By noticing that 

f Binomialfn, t) 1 
sup P|Binomial(n, t) > n • t/aj < P< sup > — 

Lemma A. 3 follows directly from [34], Lemma 1. □ 
Now for any ^ < 6 <1 introduce statistics: 

= mm —7^, Fn = mm —7^. 

{i:p(,)>n-^}l/n {i.p^^)<n-^} l/n 

Lemma A. 4. For any constants ^<5<1, 0<a<l, if r < pfdr{P), 
then, as 00, 

(A. 12) PH,{Ff < a} ^ 0, P^(„) {Ff < a} ^ 0. 

Proof. Recall that nFn{t) = J2i^{p,<t}: <t <1. We have nFn{t) ~ 
Binomial(n, t) under Hq and nFn{t) ~ Binomial(n, ir{n, t)) under h["'^ , where 

7r(n,t) = Prr{n){pi < t} > t] sluce r < pfdr(/5) we also have 
^1 

On = sup 7r{n,t)/t —> 1. 
{t>n-s} 

Observe that i/n = Fn (p^^i) ) , so 
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by Lemma A. 3, the desired result in (A. 12) follows from 

PHo{Ff <a}<n- sup P{Binomial(n, t)>n- t/a} 0, 

{i: t>n-*} 

P (n){Ff <a}<n- sup P{Binomial(n, 7r(n, t)) > n ■ t/a} -^0. □ 

1 {t:t>n-^} 

Proof of Theorem 1.4. For the FDR-controhing procedure, 

Pa) 

Reject if and only if min — — < h(n,an), 

l<i<n i/n 

where h{n,an) <a<l is any given critical value. Since the attainability of 
the FDR-controlling procedure is as good as the maximum or Bonferroni 
method, all we need to prove is that, for {r,f5) in the region p*{P) <r < 
(1 — Vl ~ the FDR-controlling method totally fails or 

(A. 13) i-*Hg{Reject Hq} + P („) {Accept Hq} 1 as n — > oo. 

Now, under h[^^ we break {1, 2, . . . , n} into two sets A^^^ and A^^^^ , where 

i € A^^ if Xi is sampled from A^(0, 1), 

i G A'^^ if Xi is sampled from N{^n, !)• 
Introduce an event: 

Ef[> = < n-'^o for some i G A^"^}. 

Smce r < we can choose Sq to be close 

enoug h to 1 such that P (n){E^°) 0. Notice that 



{Fi^\Ho}^{F^^'\{H[''\{Etr)} 



and 



P^(„,{F|«</i(n,a„)} 



{I- P^,,.,{Ef^'))P^MF^^^ <h{n,an)\{Ef^r} 
+ P^(„)(i?;^°)P^(„){F|" < h{n,an)\Ei°}, 



1 

so Pho{F2' < hin,an)} - P^(n){F|° < /i(n,a„)} ^ 0. 
Finally, by Lemma A. 4 

|PHo{Reject} - P^(„) {Reject}! 

< \Pho{F^" < h{n,an)} - P^(n){F|° < /i(n,a„)}| 
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+ \Pho{F^" < Kn,an)}\ + < h{n,an)}\ 

and the desired result in (A. 13) follows. □ 

A.3. Proof of Theorem 1.5. Under Hq, Fn ~ xL; and EFn = 2n, Var(F„) 
An. Under H^""^ with e„ = n~^, = ^2rlog(n), i < /3 < 1, < r < 1, di- 
rect calculations show that 2n < EF^ = 2n[l + 0(e„L„)], Var(F„) = 4?i[l + 
0(e„L„)]; since P > ^, the conclusion follows. 

A. 4. Proof of Theorem 1.6. To prove Theorem 1.6 we need the following 
lemma. 

Lemma A. 5. (i) For < x <t < ^, 

(ii) Let X = x{t) obey < x < t < 1. We have, as 0, 



(A.15) K^{t,x) = < 



2 x(l — xjV \ X J J X 
tlog-(l + o(l)), i/-^CX). 



Proof, (i) Letting t = sx, it is sufficient to prove that, for fixed < x < 
I and for 1 < s < 

^ , /I \ /l-sa;\ l(s-l)^ 
(A.16) slogs+ log < ^ ^ 



x J \ 1 — X J 2 1 — X 



To prove (A.16), set 



j(s) = slogs + s log' ' 



1-x J 2 1-x 
direct calculations show that /(1) = 0, /'(1) = and 

(i^Ki_ii±iM. 

s(l — x)[l — xs) 2x 
Notice that when s > 1, 1 — (s + l)x > 1 — 2sx, so, for any fixed < x < ^, 

r(s)<o, i<^<^- 

This proves (A.16). 
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-> 1, notice that < x < 



(ii) The case ^ ^ oo is obvious. For the case 
t < 1 and 



(t-x) 



{t-x) 



so 



K+{t,x)=t\og- + {l-t)\og\ — - 
X 1 — X 















X J 




\X 



+ {l-t) 



t — x 1 / t — X 



1-x 2\l-x 



2i 



+ 



it - xf 



l{t-x 



l + O t + 



□ 



2x{l-x) 

Proof of Theorem 1.6. From (A. 14) we have 

BJ+<i(HC*)2, 

so under Hq the behavior of BJ^ is well controlled. 

Now we consider the behavior of B J+ under h[^^ . We examine the cases 
r < ^ and r > (1 — y/l — /?)^ separately; these two cases overlap and together 
cover the full region ^ < (3 <l, r > p*{[3). 

First, for the case ?^ < f notice that r <j. Take ro such that < ro < r; 
as in Lemma A. 4, it is easy to prove that, under 



(n) 



max 

{j:n-4'-<P(i)<n-*''o} 



i/n 



Introduce the following statistic: 
HC 



max 



in probability. 



r-^ — I max< Vra 



Now, from (A. 15) 
and so 

BJ+> 

{i:n-i'-<P(i)<n-4'-0} 



i/n - P{i) 

V^m(i-m)' 



if 



i/n 
P(i) 



(A.17)BJ+> max nK+(Vn,P(,)) = i[HC*,j'(l + o(l)). 
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Thus, in this case, BJ^ is able to separate Hq and h["\ 

For the second case notice that (r + /3)/2-y/r < 1. Pick a constant q such 
that max{(r + /3)/2y^, -y/r} < < 1. Observe that under h["'\ 

#{i:pi<n-'>} ~ Binomial(n,L„n-[^+(v^-v^)']), 

This imphes that, under h[^\ for those p-values pj-j) ~ n""^, {i/n)/p(^^-^ ^ 1; 
so, from (A. 15), 

~ L„ Binomial(n, L„n"['^+(v^^v^)'l). 
As 1 — /? — (y^ — a/t )^ > 0, BJ+ is able to separate i/j"-* and Hq. □ 

Remark. For the second case HC* is more powerful than BJ^. In 
fact, the best BJ,^ can do is to choose i as large as possible while keep- 
ing {i/n)/p(^^-^ ^ 1. This is roughly equivalent to choosing i ~ with q 
satisfying 

ni-^n-(v^-^)' >ni-^ ^ ^> {r + ())/ {2^). 

As a result, BJ^ w L„n^^('"+^)^/(^'^'). To see the main idea, take the re- 
gion i < /3 < |, p*{l3) < r < 1 — /3 for comparison. For (r, /?) in this range, 
(HC*)2 L„n2^-^+i. Since 1 - (r + /?)V(4r) < 2r - /? + 1, HC* has better 
performance than BJ^. 

A.5. Proof of Lemma 5.1. With ''^ ' iV(0, 1), 5„ = /i^ = 2r log(n), x^(5n) =D 

zl + zl^ V{Zy + pnf, SO 

^{x'(5n)>2glog(n)} 

= P{zl + zl + --- + {z^ + fin? > V2Qlog(n)} 



TTT^ / COS-'-' e,p''~'e-^'/'d9rdp 

7r/2 p2-K 

COS^-^ 02 • • • / (i^;.-! 
-7r/2 JO 

2-1^/2+1 



cos''^^ 6*1 dOi 



VvFr((l/ - l)/2) 7|ei|<,r/2 

^ p-'^e-P'/'dp, 

[- V2r sin 6*1 + ^2q-2T cos^ Si ] -^log(n) 

ZL 
2 • 



where ^(^1,^) = {(0i,p) : |0i| < f + 2p;U„sin0i > 2{q - r)log(n)}. 
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The case u = 1 is obvious, while for v = 2 we have 
P{xliSn)>2qlog{n)} 

q-r cos^ 9i-^sin9i]'-^ 



|ei|<7r/2 
1 



1 



n 

1 



-1/4 



dx(l + o(l)) 



■ n 



[v^-^]'(l + 0(l)). 



V2^1og(n) y^-V2^\q, 

Now consider the case > 3. Notice that, for fixed r and q but large n, 
j^-[-v/rsm0i+-v/'3-^cos2 obtains its maximum rate of growth at = ^, and 
for 01 w f , 

[— \/rsin0i + Vg — rcos^i]^ 



vr 



moreover, notice that for any y ^ oo, 

r p-'^e-P'/^ dp = y--\-y"l\\ + o(l)). 
These enable us to write 



|<7r/2 



cos'^^^ 01 dBi 



[- v^sin 6»i + -y/2<j-2r cos^ 6»i ] \/log(n) 



([V2^- \/2^]Vlog(ri)) 



COS 



0<6li<7r/2 



([V2^-\/2^]Vlog(n))''~^ 
([V2^- V2^]Vlog(n))''"^ 

"(1 - x)(^-3)/2n-(V^^^+^-v^^)' dxl (1 + o(l)). 



(1 + 0(1)) 
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To evaluate this integration, notice that 



:i-x) 



(^-3)/2^-(Vg-r+r-x2-v^x)2 



dx 



^ 2;(^-3)/2^-{A/'?-2r-x+rx2-yf+v^x)2 



: n 



dx 



(1+0(1)) 



log(n) 



n 



log(n) 



lo Vlog(n), 
(log(n))(^-^)/2^-(^-v^)' 



[1 + 0(1)) 



,{u-3)/2-^/7/^i,/2^-V2^) 





u-1 



(V2g- V2r)^log(n) 

/ Lv y 
Finally, we have 

P{x'(5n)>2glog(n)} 

1 /^\(l-i')/4 1 



1 + 0(1)) 

(X^v)l2 



n 



(v^-v^) (l + o(l)). 



A/27rlog(n) \q 



n 
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